One of the key activities of any IT function is to “Keep the lights on” to ensure there is no impact to the Business operations. IT leverages Incident Management process to achieve the above Objective. An incident is something that is unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business. The main goal of Incident Management process is to provide a quick fix / workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact. In most of the organizations, incidents are created by various Business and IT Users, End Users/ Vendors if they have access to ticketing systems, and from the integrated monitoring systems and tools. Assigning the incidents to the appropriate person or unit in the support team has critical importance to provide improved user satisfaction while ensuring better allocation of support resources. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. On the other hand, manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
In the support process, incoming incidents are analyzed and assessed by organization’s support teams to fulfill the request. In many organizations, better allocation and effective usage of the valuable support resources will directly result in substantial cost savings.
Currently the incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools)
within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). This team will review
the incidents for right ticket categorization, priorities and then carry out initial diagnosis to see if they
can resolve. Around ~54% of the incidents are resolved by L1 / L2 teams. Incase L1 / L2 is unable to resolve,
they will then escalate / assign the tickets to Functional teams from Applications and Infrastructure (L3
teams). Some portions of incidents are directly assigned to L3 teams by either Monitoring tools or Callers /
Requestors. L3 teams will carry out detailed diagnosis and resolve the incidents. Around ~56% of incidents are
resolved by Functional / L3 teams. Incase if vendor support is needed, they will reach out for their support
towards incident closure.
</font>
L1 / L2 needs to spend time reviewing Standard Operating Procedures (SOPs) before assigning to Functional
teams (Minimum ~25-30% of incidents needs to be reviewed for SOPs before ticket assignment). 15 min is being
spent for SOP review for each incident. Minimum of ~1 FTE effort needed only for incident assignment to L3
teams.
During the process of incident assignments by L1 / L2 teams to functional groups, there were multiple instances of incidents getting assigned to wrong functional groups. Around ~25% of Incidents are wrongly assigned to functional teams. Additional effort needed for Functional teams to re-assign to right functional groups. During this process, some of the incidents are in queue and not addressed timely resulting in poor customer service. </font>
Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations
to reduce the resolving time of the issue and can focus on more productive tasks
</font>
In this capstone project, the goal is to build a classifier that can classify the tickets by analyzing text.Details about the data and dataset files are given in below link,
https://drive.google.com/open?id=1OZNJm81JXucV3HmZroMq6qCT2m7ez7IJ
Pre-Processing, Data Visualization and EDA
● Exploring the given Data files
● Understanding the structure of data
● Missing points in data
● Finding inconsistencies in the data
● Visualizing different patterns
● Visualizing different text features
● Dealing with missing values
● Text preprocessing
● Creating word vocabulary from the corpus of report text data
● Creating tokens as required
Model Building
Building a model architecture which can classify.
Trying different model architectures by researching state of the art for similar tasks.
Train the model
To deal with large training time, save the weights so that you can use them when training the model for the second time without starting from scratch.
Test the Model, Fine-tuning and Repeat
● Test the model and report as per evaluation metrics
● Try different models
● Try different evaluation metrics
● Set different hyper parameters, by trying different optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc..for these models to fine-tune them
● Report evaluation metrics for these models along with your observation on how changing different hyper parameters leads to change in the final evaluation metric
</font>
The objective of the project is,
Learn how to use different classification models.
Use transfer learning to use pre-built models.
Learn to set the optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc.
Read different research papers of given domain to obtain the knowledge of advanced models for the given problem.
</font>
import re
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.utils import resample
from wordcloud import WordCloud, STOPWORDS
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet,stopwords
from nltk.stem.porter import PorterStemmer
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. [nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Downloading package wordnet to /root/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip. [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /root/nltk_data... [nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
Importing the provided dataset
dataset = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/capstone project/input_data.xlsx')
dataset.head()
| Short description | Description | Caller | Assignment group | |
|---|---|---|---|---|
| 0 | login issue | -verified user details.(employee# & manager na... | spxjnwir pjlcoqds | GRP_0 |
| 1 | outlook | _x000D_\n_x000D_\nreceived from: hmjdrvpb.komu... | hmjdrvpb komuaywn | GRP_0 |
| 2 | cant log in to vpn | _x000D_\n_x000D_\nreceived from: eylqgodm.ybqk... | eylqgodm ybqkwiam | GRP_0 |
| 3 | unable to access hr_tool page | unable to access hr_tool page | xbkucsvz gcpydteq | GRP_0 |
| 4 | skype error | skype error | owlgqjme qhcozdfx | GRP_0 |
Exploring a few random samples
dataset.sample(10)
| Short description | Description | Caller | Assignment group | |
|---|---|---|---|---|
| 5993 | laptop will not turn on, either docked or un-d... | laptop will not turn on, either docked or un-d... | eqcudbks zbjeqruy | GRP_3 |
| 6293 | adobe reader on my pc suddenly will not work, ... | adobe reader on my pc suddenly will not work, ... | pzrskcon pobsajnx | GRP_3 |
| 7750 | md04, display stock, is locking up with create... | the window is locked up, can't do screenshots,... | wkeiqpud dzgemhbk | GRP_0 |
| 621 | uacyltoe hxgaycze ignore | uacyltoe hxgaycze | jloygrwh acvztedi | GRP_36 |
| 5760 | log in permissions not working. need to chang... | name:etlfrucw ziewxqof\nlanguage:\nbrowser:mic... | etlfrucw ziewxqof | GRP_0 |
| 3465 | apac, apac: temprature sensor#1, yellow - on ... | sw#1, temperature sensor#1, yellow - on comp... | mnlazfsr mtqrkhnx | GRP_4 |
| 7011 | laptop screen flickering | hello team,_x000D_\n_x000D_\nmy laptop monitor... | nvxkdqfi slkojtcg | GRP_19 |
| 4271 | printer setrup | id01 printer setup | aqourvgz mkehgcdu | GRP_19 |
| 6696 | collaboration_platform site ownership | _x000D_\n_x000D_\nreceived from: unrbafjx.reys... | unrbafjx reyshakw | GRP_16 |
| 7167 | i do not have access to ethics training module | error message: welcome to the company business... | fhbvisgc cbxpgkwl | GRP_23 |
dataset.tail()
| Short description | Description | Caller | Assignment group | |
|---|---|---|---|---|
| 8495 | emails not coming in from zz mail | _x000D_\n_x000D_\nreceived from: avglmrts.vhqm... | avglmrts vhqmtiua | GRP_29 |
| 8496 | telephony_software issue | telephony_software issue | rbozivdq gmlhrtvp | GRP_0 |
| 8497 | vip2: windows password reset for tifpdchb pedx... | vip2: windows password reset for tifpdchb pedx... | oybwdsgx oxyhwrfz | GRP_0 |
| 8498 | machine não está funcionando | i am unable to access the machine utilities to... | ufawcgob aowhxjky | GRP_62 |
| 8499 | an mehreren pc`s lassen sich verschiedene prgr... | an mehreren pc`s lassen sich verschiedene prgr... | kqvbrspl jyzoklfx | GRP_49 |
Checking the shape of the data
shape = dataset.shape
print(f"{shape} | The dataset has {shape[0]} rows and {shape[1]} columns")
(8500, 4) | The dataset has 8500 rows and 4 columns
Checking the columns
dataset.columns
Index(['Short description', 'Description', 'Caller', 'Assignment group'], dtype='object')
Checking datatypes of each column
dataset.dtypes
Short description object Description object Caller object Assignment group object dtype: object
Checking the number of categories
len(dataset["Assignment group"].unique())
74
There are 74 unique categories in "Assignment group"
pd.DataFrame( dataset["Assignment group"].value_counts() )
| Assignment group | |
|---|---|
| GRP_0 | 3976 |
| GRP_8 | 661 |
| GRP_24 | 289 |
| GRP_12 | 257 |
| GRP_9 | 252 |
| ... | ... |
| GRP_64 | 1 |
| GRP_67 | 1 |
| GRP_35 | 1 |
| GRP_70 | 1 |
| GRP_73 | 1 |
74 rows × 1 columns
Checking number of unique values in each column
dataset.nunique()
Short description 7481 Description 7817 Caller 2950 Assignment group 74 dtype: int64
There seems to be duplicates in the dataset ( number of unique values is lower than the total number of rows ) so we should consider dropping them
Checking extra information about the dataset
dataset.describe(include='all')
| Short description | Description | Caller | Assignment group | |
|---|---|---|---|---|
| count | 8492 | 8499 | 8500 | 8500 |
| unique | 7481 | 7817 | 2950 | 74 |
| top | password reset | the | bpctwhsn kzqsbmtp | GRP_0 |
| freq | 38 | 56 | 810 | 3976 |
dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8500 entries, 0 to 8499 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Short description 8492 non-null object 1 Description 8499 non-null object 2 Caller 8500 non-null object 3 Assignment group 8500 non-null object dtypes: object(4) memory usage: 265.8+ KB
There seems to be a few null values in "Short description" and "Description"
Checking for "NaN" values
dataset.isna().sum()
Short description 8 Description 1 Caller 0 Assignment group 0 dtype: int64
Cheking for "null" values
dataset.isnull().sum()
Short description 8 Description 1 Caller 0 Assignment group 0 dtype: int64
There are 8 null/NaN values in "Short description" and 1 null/NaN value in
"Desciption"
Cheking the null / NaN datapoints
dataset[dataset.isna().any(axis=1)]
| Short description | Description | Caller | Assignment group | |
|---|---|---|---|---|
| 2604 | NaN | _x000D_\n_x000D_\nreceived from: ohdrnswl.rezu... | ohdrnswl rezuibdt | GRP_34 |
| 3383 | NaN | _x000D_\n-connected to the user system using t... | qftpazns fxpnytmk | GRP_0 |
| 3906 | NaN | -user unable tologin to vpn._x000D_\n-connect... | awpcmsey ctdiuqwe | GRP_0 |
| 3910 | NaN | -user unable tologin to vpn._x000D_\n-connect... | rhwsmefo tvphyura | GRP_0 |
| 3915 | NaN | -user unable tologin to vpn._x000D_\n-connect... | hxripljo efzounig | GRP_0 |
| 3921 | NaN | -user unable tologin to vpn._x000D_\n-connect... | cziadygo veiosxby | GRP_0 |
| 3924 | NaN | name:wvqgbdhm fwchqjor\nlanguage:\nbrowser:mic... | wvqgbdhm fwchqjor | GRP_0 |
| 4341 | NaN | _x000D_\n_x000D_\nreceived from: eqmuniov.ehxk... | eqmuniov ehxkcbgj | GRP_0 |
| 4395 | i am locked out of skype | NaN | viyglzfo ajtfzpkb | GRP_0 |
8 of the null/NaN values are from GRP_0 and 1 is from GRP_34 so before dropping /
replacing let us check the value count of GRP_0 and GRP_34
counts = dataset["Assignment group"].value_counts()
print(f"Count of GRP_0 : {counts.GRP_0}")
print(f"Count of GRP_34 : {counts.GRP_34}")
Count of GRP_0 : 3976 Count of GRP_34 : 62
since only one of the fields is missing for each of the cases let us replace NaN with
" " ( empty string ). We will do this while dealing with missing values
Cheking the distribution across "Assignment group"
plt.figure(figsize=(22,15))
ax=sns.countplot(x='Assignment group',data=dataset)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
plt.tight_layout()
plt.show()
plt.figure(figsize=(20,5))
ax = sns.countplot(x="Assignment group", data=dataset, order=dataset["Assignment group"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90,ha="right")
for p in ax.patches:
ax.annotate(str(format(p.get_height()/len(dataset.index)*100, '.2f')+"%"), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'bottom', rotation=90, xytext = (0, 10), textcoords = 'offset points')
plt.tight_layout()
plt.show()
From the above graph we can infer that the dataset is highly unbalanced with 3976 datapoints in GRP_0
Most frequent 20 groups
dataset['Assignment group'].value_counts().nlargest(20)
GRP_0 3976 GRP_8 661 GRP_24 289 GRP_12 257 GRP_9 252 GRP_2 241 GRP_19 215 GRP_3 200 GRP_6 184 GRP_13 145 GRP_10 140 GRP_5 129 GRP_14 118 GRP_25 116 GRP_33 107 GRP_4 100 GRP_29 97 GRP_18 88 GRP_16 85 GRP_17 81 Name: Assignment group, dtype: int64
Least frequent 20 groups
dataset['Assignment group'].value_counts().nsmallest(20)
GRP_61 1 GRP_64 1 GRP_67 1 GRP_35 1 GRP_70 1 GRP_73 1 GRP_57 2 GRP_54 2 GRP_69 2 GRP_71 2 GRP_72 2 GRP_68 3 GRP_63 3 GRP_38 3 GRP_58 3 GRP_56 3 GRP_66 4 GRP_32 4 GRP_43 5 GRP_49 6 Name: Assignment group, dtype: int64
Checking for duplicates
sum(dataset.duplicated())
83
Checking the Caller column
dataset["Caller"].sample(10)
6859 jloygrwh acvztedi 6746 nhixruet elnjqdwg 7883 iauqlrjk nijdaukz 1133 dkxlpvnr narxcgjh 942 qkmgtnla buraxcij 2551 hgcrtxez azoeingw 5324 bpctwhsn kzqsbmtp 3713 tuqrvowp fxmzkvqo 5199 ockwafib wftboqry 466 jcgzqndm hukibzqa Name: Caller, dtype: object
The "Caller" column does not seem to provide any useful information for classification so will
drop it
Let us work on a copy of the dataset in case we need the original later
dataset_a = dataset.copy()
Dropping "Caller" Column
dataset_a.drop("Caller",axis=1,inplace = True)
dataset_a.head()
| Short description | Description | Assignment group | |
|---|---|---|---|
| 0 | login issue | -verified user details.(employee# & manager na... | GRP_0 |
| 1 | outlook | _x000D_\n_x000D_\nreceived from: hmjdrvpb.komu... | GRP_0 |
| 2 | cant log in to vpn | _x000D_\n_x000D_\nreceived from: eylqgodm.ybqk... | GRP_0 |
| 3 | unable to access hr_tool page | unable to access hr_tool page | GRP_0 |
| 4 | skype error | skype error | GRP_0 |
Dropping duplicates
print("Number of duplicates :",sum(dataset_a.duplicated()))
Number of duplicates : 591
After dropping "Caller", The duplicate count increased from 83 to 591. We will now drop the
duplicates
dataset_a.drop_duplicates(inplace = True)
print("Number of duplicates :",sum(dataset_a.duplicated()))
Number of duplicates : 0
Merging lower frequency categories
This is to semi-fix the unbalanced nature of the dataset
counts = dataset_a["Assignment group"].value_counts()
threshold_counts = [0,0,0,0,0]
for x in counts.values :
if x <50 :
threshold_counts[0] += 1
if x <100:
threshold_counts[1] += 1
if x <150:
threshold_counts[2] += 1
if x <200:
threshold_counts[3] += 1
if x <250:
threshold_counts[4] += 1
print(f"Less than 50 : {threshold_counts[0]}")
print(f"Less than 100 : {threshold_counts[1]}")
print(f"Less than 150 : {threshold_counts[2]}")
print(f"Less than 200 : {threshold_counts[3]}")
print(f"Less than 250 : {threshold_counts[4]}")
Less than 50 : 50 Less than 100 : 59 Less than 150 : 65 Less than 200 : 66 Less than 250 : 69
init_data = [
{'Description':'1 ticket','Ticket Count':0},
{'Description':'2-5 ticket','Ticket Count':0},
{'Description':'6-10 ticket','Ticket Count':0},
{'Description':'11-20 ticket','Ticket Count':0},
{'Description':'21-50 ticket','Ticket Count':0},
{'Description':'51-100 ticket','Ticket Count':0},
{'Description':'>100 ticket','Ticket Count':0},
]
df_bins = pd.DataFrame(init_data)
for x in counts.values :
if x <=1 :
df_bins["Ticket Count"][0] += 1
elif x <= 5:
df_bins["Ticket Count"][1] += 1
elif x <= 10:
df_bins["Ticket Count"][2] += 1
elif x <= 20:
df_bins["Ticket Count"][3] += 1
elif x <= 50:
df_bins["Ticket Count"][4] += 1
elif x <= 100:
df_bins["Ticket Count"][5] += 1
else :
df_bins["Ticket Count"][6] += 1
df_bins
| Description | Ticket Count | |
|---|---|---|
| 0 | 1 ticket | 6 |
| 1 | 2-5 ticket | 13 |
| 2 | 6-10 ticket | 6 |
| 3 | 11-20 ticket | 9 |
| 4 | 21-50 ticket | 16 |
| 5 | 51-100 ticket | 9 |
| 6 | >100 ticket | 15 |
plt.figure(figsize=(6, 4))
plt.pie(df_bins['Ticket Count'],labels=df_bins['Description'],autopct='%1.1f%%', startangle=15, shadow = True);
plt.title('Assignment Groups Distribution')
plt.axis('equal');
Total number of categories are 74 so from the above thresholds we will combine the categories with frequency
less than 100 to a new group called "GRP_X"
counts = dataset_a["Assignment group"].value_counts()
dataset_a["Assignment group"] = np.where(counts[dataset_a["Assignment group"]] < 100 , "GRP_X", dataset_a["Assignment group"])
dataset_a["Assignment group"].value_counts()
GRP_0 3429 GRP_X 1450 GRP_8 645 GRP_24 285 GRP_12 256 GRP_9 252 GRP_2 241 GRP_19 214 GRP_3 200 GRP_6 183 GRP_13 145 GRP_10 140 GRP_5 128 GRP_14 118 GRP_25 116 GRP_33 107 Name: Assignment group, dtype: int64
Creating a copy of the dataset_a and adding relavent columns to visualise
df = dataset_a.copy()
counts = df["Assignment group"].value_counts()
df["description_length"] = [len(str(x)) for x in df.Description]
df["short_length"] = [len(str(x)) for x in df["Short description"]]
plt.figure(figsize=(22,15))
ax=sns.countplot(x='Assignment group',data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
plt.tight_layout()
plt.show()
plt.figure(figsize=(22,10))
ax = sns.countplot(x="Assignment group", data=df, order=df["Assignment group"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
for p in ax.patches:
ax.annotate(str(format(p.get_height()/len(df.index)*100, '.2f')+"%"), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'bottom', rotation=90, xytext = (0, 10), textcoords = 'offset points')
Visualising "description_length"
fig = px.box(df, x="description_length")
fig.show()
fig = px.box(df, y="description_length",x="Assignment group")
fig.show()
Visualising "short_length" ( Length of Short description )
fig = px.box(df, x="short_length")
fig.show()
fig = px.box(df, x = "Assignment group", y="short_length")
fig.show()
Length of short description for each assignment group
df_words=df.copy()
df_words["shortDesc_count"] = df["Short description"].apply(lambda x: len(str(x)))
plt.figure(figsize=(10,10))
sns.barplot(x=df_words["shortDesc_count"],y=df_words["Assignment group"])
plt.show()
The most word used in a short description for each assignment group
text = df.groupby("Assignment group")["Short description"].apply(lambda x: "".join(str(x)))
index = 0
plt.figure(figsize=(15,20))
for key,value in text.iteritems():
# Create and generate a word cloud image:
wordcloud = WordCloud(stopwords=set(STOPWORDS)).generate(str(value))
index+=1
plt.subplot(12,3,index)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(key)
plt.tight_layout()
Length of Description for each assignment group
df_words=df.copy()
df_words["description_count"] = df["Description"].apply(lambda x: len(str(x)))
plt.figure(figsize=(10,10))
sns.barplot(x=df_words["description_count"],y=df_words["Assignment group"])
plt.show()
The most word used in a Description for each assignment group
text = df_words.groupby("Assignment group")["Description"].apply(lambda x: "".join(str(x)))
index = 0
plt.figure(figsize=(15,20))
for key,value in text.iteritems():
# Create and generate a word cloud image:
wordcloud = WordCloud(stopwords=set(STOPWORDS)).generate(str(value))
index+=1
plt.subplot(12,3,index)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(key)
plt.tight_layout()
Some of the Groups seem to contain non-english words so let us run language detection
Word Cloud for tickets with Assignment group 'GRP_0' which is the larest group
wordcloud = WordCloud(width = 800, height = 800, stopwords=set(STOPWORDS),min_font_size = 10).generate(str(text["GRP_0"]))
plt.figure(figsize=(6,6),facecolor = None)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("GRP_0")
plt.show()
Word Cloud for tickets with Assignment group 'GRP_24'
wordcloud = WordCloud(width = 800, height = 800, stopwords=set(STOPWORDS),min_font_size = 10).generate(str(text["GRP_24"]))
plt.figure(figsize=(6,6),facecolor = None)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("GRP_24")
plt.show()
Word Cloud for tickets with Assignment group 'GRP_33'
wordcloud = WordCloud(width = 800, height = 800, stopwords=set(STOPWORDS),min_font_size = 10).generate(str(text["GRP_33"]))
plt.figure(figsize=(6,6),facecolor = None)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("GRP_33")
plt.show()
!pip install langdetect
from langdetect import detect
Collecting langdetect
Downloading langdetect-1.0.9.tar.gz (981 kB)
|████████████████████████████████| 981 kB 12.1 MB/s
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from langdetect) (1.15.0)
Building wheels for collected packages: langdetect
Building wheel for langdetect (setup.py) ... done
Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=680f4db36f0690645488f7903acb3c5b67f3357c4d9016bde3496314369afe08
Stored in directory: /root/.cache/pip/wheels/c5/96/8a/f90c59ed25d75e50a8c10a1b1c2d4c402e4dacfa87f3aff36a
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
def fn_lan_detect(df):
try:
return detect(df)
except:
return 'no'
df_words['language'] =df_words['Description'].apply(fn_lan_detect)
df_words["language"].value_counts()
en 6593 de 399 af 178 it 109 fr 106 sv 88 no 71 da 65 nl 63 ca 53 es 36 pl 28 pt 26 so 13 sl 13 ro 12 cy 9 tl 8 sq 8 et 6 id 6 fi 6 hr 5 tr 4 cs 2 lt 1 sk 1 Name: language, dtype: int64
x = df_words["language"].value_counts()
x=x.sort_index()
plt.figure(figsize=(10,6))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Distribution of text by language")
plt.ylabel('number of records')
plt.xlabel('Language')
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show();
df_words["language"][ df_words["language"]!='en'].value_counts().sum()
1316
We had decided to replace the NaN values with empty strings
print("Before replacing \n")
dataset_a.isna().sum()
Before replacing
Short description 5 Description 1 Assignment group 0 dtype: int64
dataset_a.fillna('', inplace=True)
print("After replacing \n")
dataset_a.isna().sum()
After replacing
Short description 0 Description 0 Assignment group 0 dtype: int64
Combining Short description and Description column
As this is a multi-label classification problem we will merge the Short description and Description columns together, so that all the description terms are together for a particular sentence.
dataset_a
| Short description | Description | Assignment group | |
|---|---|---|---|
| 0 | login issue | -verified user details.(employee# & manager na... | GRP_0 |
| 1 | outlook | _x000D_\n_x000D_\nreceived from: hmjdrvpb.komu... | GRP_0 |
| 2 | cant log in to vpn | _x000D_\n_x000D_\nreceived from: eylqgodm.ybqk... | GRP_0 |
| 3 | unable to access hr_tool page | unable to access hr_tool page | GRP_0 |
| 4 | skype error | skype error | GRP_0 |
| ... | ... | ... | ... |
| 8495 | emails not coming in from zz mail | _x000D_\n_x000D_\nreceived from: avglmrts.vhqm... | GRP_X |
| 8496 | telephony_software issue | telephony_software issue | GRP_0 |
| 8497 | vip2: windows password reset for tifpdchb pedx... | vip2: windows password reset for tifpdchb pedx... | GRP_0 |
| 8498 | machine não está funcionando | i am unable to access the machine utilities to... | GRP_X |
| 8499 | an mehreren pc`s lassen sich verschiedene prgr... | an mehreren pc`s lassen sich verschiedene prgr... | GRP_X |
7909 rows × 3 columns
dataset_a['Text']=dataset_a.apply(lambda col : [col['Short description'],col['Description']], axis=1)
dataset_a.drop(labels =['Short description', 'Description'], axis = 1,inplace = True)
dataset_a
| Assignment group | Text | |
|---|---|---|
| 0 | GRP_0 | [login issue, -verified user details.(employee... |
| 1 | GRP_0 | [outlook, _x000D_\n_x000D_\nreceived from: hmj... |
| 2 | GRP_0 | [cant log in to vpn, _x000D_\n_x000D_\nreceive... |
| 3 | GRP_0 | [unable to access hr_tool page, unable to acce... |
| 4 | GRP_0 | [skype error , skype error ] |
| ... | ... | ... |
| 8495 | GRP_X | [emails not coming in from zz mail, _x000D_\n_... |
| 8496 | GRP_0 | [telephony_software issue, telephony_software ... |
| 8497 | GRP_0 | [vip2: windows password reset for tifpdchb ped... |
| 8498 | GRP_X | [machine não está funcionando, i am unable t... |
| 8499 | GRP_X | [an mehreren pc`s lassen sich verschiedene prg... |
7909 rows × 2 columns
Sample of merged "Text" Column
dataset_a['Text'][1]
['outlook', '_x000D_\n_x000D_\nreceived from: hmjdrvpb.komuaywn@gmail.com_x000D_\n_x000D_\nhello team,_x000D_\n_x000D_\nmy meetings/skype meetings etc are not appearing in my outlook calendar, can somebody please advise how to correct this?_x000D_\n_x000D_\nkind ']
dataset_text =dataset_a[['Text']]
dataset_text["Text"] = dataset_text["Text"].astype(str)
dataset_text.head()
| Text | |
|---|---|
| 0 | ['login issue', '-verified user details.(emplo... |
| 1 | ['outlook', '_x000D_\n_x000D_\nreceived from: ... |
| 2 | ['cant log in to vpn', '_x000D_\n_x000D_\nrece... |
| 3 | ['unable to access hr_tool page', 'unable to a... |
| 4 | ['skype error ', 'skype error '] |
Removing unwanted characters and numbers
dataset_text["Text"]=dataset_text["Text"].str.replace('[^A-Za-z]',' ')
dataset_text.head()
| Text | |
|---|---|
| 0 | login issue verified user details emplo... |
| 1 | outlook x D n x D nreceived from ... |
| 2 | cant log in to vpn x D n x D nrece... |
| 3 | unable to access hr tool page unable to a... |
| 4 | skype error skype error |
Converting to lower case
dataset_text["Text"] = dataset_text["Text"].str.lower()
dataset_text.head()
| Text | |
|---|---|
| 0 | login issue verified user details emplo... |
| 1 | outlook x d n x d nreceived from ... |
| 2 | cant log in to vpn x d n x d nrece... |
| 3 | unable to access hr tool page unable to a... |
| 4 | skype error skype error |
Removing unnecessary white spaces
dataset_text["Text"]= dataset_text["Text"].str.strip()
dataset_text.head()
| Text | |
|---|---|
| 0 | login issue verified user details employe... |
| 1 | outlook x d n x d nreceived from hm... |
| 2 | cant log in to vpn x d n x d nreceiv... |
| 3 | unable to access hr tool page unable to acc... |
| 4 | skype error skype error |
Removing stop words
we will get the list of stopwords from nltk library
sw = stopwords.words('english')
np.array(sw)
array(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
"you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to',
'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when',
'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',
'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn',
"couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't",
'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",
'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't",
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
dtype='<U10')
dataset_text["Text"]= dataset_text["Text"].apply(lambda x: ' '.join([word for word in x.split() if word not in sw]))
dataset_text.head()
| Text | |
|---|---|
| 0 | login issue verified user details employee man... |
| 1 | outlook x n x nreceived hmjdrvpb komuaywn gmai... |
| 2 | cant log vpn x n x nreceived eylqgodm ybqkwiam... |
| 3 | unable access hr tool page unable access hr to... |
| 4 | skype error skype error |
Performing Lemmatization
stemmer = PorterStemmer() #set stemmer
lemmatizer = WordNetLemmatizer() # set lemmatizer
def nltk_tag_to_wordnet_tag(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
def lemmatize_sentence(sentence):
#tokenize the sentence and find the POS tag for each token
nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
#tuple of (token, wordnet_tag)
wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
lemmatized_sentence = []
for word, tag in wordnet_tagged:
if tag is None:
#if there is no available tag, append the token as is
lemmatized_sentence.append(word)
else:
#else use the tag to lemmatize the token
lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
return " ".join(lemmatized_sentence)
for index, row in dataset_text.iterrows():
dataset_text.at[index, 'Text']= lemmatize_sentence(row["Text"])
dataset_text.head()
| Text | |
|---|---|
| 0 | login issue verify user detail employee manage... |
| 1 | outlook x n x nreceived hmjdrvpb komuaywn gmai... |
| 2 | cant log vpn x n x nreceived eylqgodm ybqkwiam... |
| 3 | unable access hr tool page unable access hr to... |
| 4 | skype error skype error |
Replacing the Text coulmn of dataset
dataset_a['Text']=dataset_text['Text']
dataset_a.head()
| Assignment group | Text | |
|---|---|---|
| 0 | GRP_0 | login issue verify user detail employee manage... |
| 1 | GRP_0 | outlook x n x nreceived hmjdrvpb komuaywn gmai... |
| 2 | GRP_0 | cant log vpn x n x nreceived eylqgodm ybqkwiam... |
| 3 | GRP_0 | unable access hr tool page unable access hr to... |
| 4 | GRP_0 | skype error skype error |
Visualising the text column after processing
text = dataset_a.groupby("Assignment group")["Text"].apply(lambda x: "".join(str(x)))
index = 0
plt.figure(figsize=(15,20))
for key,value in text.iteritems():
wordcloud = WordCloud(stopwords=set(STOPWORDS)).generate(str(value))
index+=1
plt.subplot(12,3,index)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title(key)
plt.tight_layout()
Language Translation
def fn_lan_detect(df):
try:
return detect(df)
except:
return 'no'
df_test = dataset_a.copy()
df_test['language'] =df_test['Text'].apply(fn_lan_detect)
df_test["language"].value_counts()
en 5819 fr 413 de 380 af 326 sv 138 nl 115 ca 112 no 93 it 93 so 76 da 59 ro 56 es 54 pl 50 pt 31 cy 23 sl 14 et 13 sq 11 id 5 tl 5 hr 5 vi 4 cs 4 fi 4 lt 3 sk 3 Name: language, dtype: int64
x = df_test["language"].value_counts()
x=x.sort_index()
plt.figure(figsize=(10,6))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Distribution of text by language")
plt.ylabel('number of records')
plt.xlabel('Language')
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show();
new_df = df_test[ df_words["language"] != "en"]
new_df
| Assignment group | Text | language | |
|---|---|---|---|
| 4 | GRP_0 | skype error skype error | no |
| 14 | GRP_0 | unable open payslip unable open payslip | fr |
| 19 | GRP_0 | unable sign vpn unable sign vpn | fr |
| 20 | GRP_0 | unable check payslip unable check payslip | fr |
| 22 | GRP_0 | unable connect vpn unable connect vpn | fr |
| ... | ... | ... | ... |
| 8465 | GRP_X | vpn x vpn x xad atcbvglq bdvmuszt gmail com | ca |
| 8471 | GRP_X | x x x f x x | so |
| 8475 | GRP_0 | etime time card update information etime time ... | it |
| 8486 | GRP_0 | ticket update ticket ticket update ticket | sv |
| 8499 | GRP_X | mehreren pc lassen sich verschiedene prgramdnt... | de |
1316 rows × 3 columns
!pip install deep-translator
from deep_translator import GoogleTranslator
Collecting deep-translator
Downloading deep_translator-1.8.3-py3-none-any.whl (29 kB)
Requirement already satisfied: requests<3.0.0,>=2.23.0 in /usr/local/lib/python3.7/dist-packages (from deep-translator) (2.23.0)
Collecting beautifulsoup4<5.0.0,>=4.9.1
Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
|████████████████████████████████| 97 kB 4.9 MB/s
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.7/dist-packages (from beautifulsoup4<5.0.0,>=4.9.1->deep-translator) (2.3.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (2021.10.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (2.10)
Installing collected packages: beautifulsoup4, deep-translator
Attempting uninstall: beautifulsoup4
Found existing installation: beautifulsoup4 4.6.3
Uninstalling beautifulsoup4-4.6.3:
Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed beautifulsoup4-4.10.0 deep-translator-1.8.3
len(new_df)
1316
new_df
| Assignment group | Text | language | |
|---|---|---|---|
| 4 | GRP_0 | skype error skype error | no |
| 14 | GRP_0 | unable open payslip unable open payslip | fr |
| 19 | GRP_0 | unable sign vpn unable sign vpn | fr |
| 20 | GRP_0 | unable check payslip unable check payslip | fr |
| 22 | GRP_0 | unable connect vpn unable connect vpn | fr |
| ... | ... | ... | ... |
| 8465 | GRP_X | vpn x vpn x xad atcbvglq bdvmuszt gmail com | ca |
| 8471 | GRP_X | x x x f x x | so |
| 8475 | GRP_0 | etime time card update information etime time ... | it |
| 8486 | GRP_0 | ticket update ticket ticket update ticket | sv |
| 8499 | GRP_X | mehreren pc lassen sich verschiedene prgramdnt... | de |
1316 rows × 3 columns
df_test["Text"][0]
'login issue verify user detail employee manager name x n check user name ad reset password x n advise user login check x n caller confirm able login x n issue resolve'
for i, row in df_test.iterrows():
if (row["language"] != "en"):
try :
df_test["Text"][i] = GoogleTranslator(source='auto', target='en').translate(row['Text'])
except :
df_test["Text"][i] = row["Text"]
df_test
| Assignment group | Text | language | |
|---|---|---|---|
| 0 | GRP_0 | login issue verify user detail employee manage... | en |
| 1 | GRP_0 | outlook x n x nreceived hmjdrvpb komuaywn gmai... | en |
| 2 | GRP_0 | cant log vpn x n x nreceived eylqgodm ybqkwiam... | vi |
| 3 | GRP_0 | unable access hr tool page unable access hr to... | en |
| 4 | GRP_0 | skype error skype error | no |
| ... | ... | ... | ... |
| 8495 | GRP_X | email come zz mail x n x nreceived avglmrts vh... | ro |
| 8496 | GRP_0 | telephony software issue telephony software issue | en |
| 8497 | GRP_0 | vip window password reset tifpdchb pedxruyf vi... | en |
| 8498 | GRP_X | machine n est funcionando unable access machin... | en |
| 8499 | GRP_X | different programs cannot be opened with sever... | de |
7909 rows × 3 columns
dataset_a["Text"] = df_test["Text"]
Taking care of unbalanced data
Even after merging groups with less than or equal to 100 tickets together we have class imbalance issue.So we need to resample the dataset
maxOthers =dataset_a["Assignment group"].value_counts().max()
dataset_resampled = dataset_a[0:0]
for grp in dataset_a['Assignment group'].unique():
ticket_dataset_group = dataset_a[dataset_a['Assignment group'] == grp]
resampled = resample(ticket_dataset_group, replace=True, n_samples=int(maxOthers), random_state=123)
dataset_resampled = dataset_resampled.append(resampled)
descending_order = dataset_resampled['Assignment group'].value_counts().sort_values(ascending=False).index
plt.subplots(figsize=(22,5))
ax=sns.countplot(x='Assignment group', data=dataset_resampled)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
plt.tight_layout()
plt.show()
word_counts=dict()
for words in dataset_resampled.Text.str.split():
for word in words:
if word in word_counts:
word_counts[str(word)]+=1
else:
word_counts[str(word)]=1
word_counts
{'erp': 11405,
'login': 1847,
'trouble': 154,
'x': 178853,
'n': 154389,
'nreceived': 7177,
'xosycftu': 12,
'olhpmsdw': 12,
'gmail': 11454,
'com': 27355,
'nhello': 3122,
'nplease': 4695,
'see': 4241,
'error': 8643,
'ni': 3243,
'log': 1937,
'tell': 278,
'solution': 267,
'frequent': 68,
'account': 3883,
'lock': 908,
'nran': 34,
'status': 1527,
'naccount': 45,
'get': 5171,
'one': 1782,
'wifi': 399,
'device': 1545,
'ntook': 2,
'control': 477,
'machine': 1260,
'start': 2859,
'credential': 201,
'manager': 1429,
'service': 3094,
'ndeleted': 5,
'bunch': 13,
'password': 3995,
'save': 651,
'nasked': 2,
'user': 8299,
'remove': 719,
'company': 21333,
'secure': 26,
'mobile': 383,
'keep': 572,
'disabled': 16,
'couple': 89,
'day': 1017,
'enable': 227,
'home': 414,
'nalso': 161,
'undock': 2,
'computer': 2203,
'dock': 687,
'back': 1454,
'tried': 222,
'work': 9031,
'nlocked': 2,
'system': 6844,
'unlock': 759,
'difficulty': 25,
'nkeeping': 2,
'ticket': 7445,
'observation': 2,
'narranging': 2,
'call': 2334,
'tomorrow': 109,
'ooo': 17,
'issue': 9339,
'nlogin': 31,
'verify': 884,
'detail': 2336,
'employee': 1350,
'name': 4757,
'check': 3344,
'ad': 675,
'advise': 853,
'caller': 69,
'confirm': 696,
'able': 2357,
'resolve': 1419,
'engineering': 4036,
'tool': 27424,
'access': 9676,
'sid': 13556,
'miecoszw': 4,
'mhvbnodw': 31,
'nhi': 1767,
'nnot': 139,
'nuser': 1053,
'id': 5893,
'heghjyder': 2,
'ngajthyana': 27,
'hegdergyt': 27,
'nmanager': 246,
'finance': 540,
'nmiecoszw': 27,
'mailto': 1593,
'vitalyst': 25,
'crm': 1172,
'configuration': 304,
'outlook': 2106,
'jeshyensky': 1,
'gkzedilm': 114,
'tkpfumeb': 114,
'jesjnlyenm': 31,
'aplications': 57,
'netweaver': 131,
'nnetweaver': 31,
'report': 4130,
'many': 297,
'incorrect': 414,
'logins': 65,
'unable': 5117,
'contact': 1921,
'server': 6471,
'suspect': 87,
'block': 2104,
'due': 956,
'nthe': 1529,
'recently': 277,
'change': 3653,
'us': 2540,
'dedicate': 19,
'hr': 3087,
'clock': 101,
'hourly': 2,
'wvngzrca': 14,
'sfmrzdth': 20,
'add': 1835,
'appropriate': 188,
'option': 897,
'follow': 1771,
'desk': 365,
'laptops': 25,
'nu': 3,
'zjihgovn': 1,
'cqxahony': 1,
'nus': 4,
'dpuifqeo': 82,
'eglwsfkn': 82,
'ihlsmzdn': 1,
'cnhqgzwt': 1,
'ewgihcnz': 1,
'vdjqoeip': 1,
'moxnqszg': 1,
'zgdckste': 1,
'bghrbie': 1,
'crhyley': 1,
'nif': 948,
'may': 1380,
'assistance': 158,
'please': 13108,
'feel': 9,
'free': 894,
'time': 4002,
'nwvngzrca': 6,
'phr': 24,
'nhuman': 31,
'resource': 239,
'plant': 4215,
'usa': 2790,
'nc': 168,
'nmy': 383,
'use': 3412,
'management': 1121,
'vpn': 1074,
'good': 488,
'morning': 635,
'nlanguage': 480,
'nbrowser': 487,
'microsoft': 1217,
'internet': 1125,
'explorer': 895,
'nemail': 840,
'ncustomer': 604,
'number': 3424,
'nsummary': 806,
'ncan': 998,
'assist': 463,
'serch': 6,
'hello': 1453,
'specify': 798,
'date': 1465,
'bottom': 63,
'window': 1497,
'show': 2423,
'give': 829,
'unlocked': 64,
'skype': 1119,
'meeting': 436,
'dlougnqw': 4,
'jiuybxew': 4,
'ngood': 328,
'reset': 2935,
'lose': 144,
'set': 1672,
'help': 4488,
'recall': 164,
'bring': 72,
'last': 637,
'update': 3414,
'share': 377,
'screen': 2157,
'jadqhguy': 6,
'fvwhyenp': 6,
'ntelephone': 606,
'arsbtkvd': 4,
'qieagkos': 4,
'production': 2948,
'daypay': 28,
'request': 2030,
'print': 3730,
'purchase': 304,
'order': 5566,
'drawing': 285,
'business': 1032,
'client': 1471,
'view': 1127,
'download': 621,
'refer': 598,
'mr': 259,
'dwfiykeo': 85,
'argtxmvcumar': 234,
'open': 2746,
'collaboration': 833,
'platform': 933,
'betshdy': 1,
'zxvjsipd': 1,
'jbzmgyvd': 1,
'can': 3521,
'not': 5158,
'sits': 3,
'spin': 1,
'inbox': 224,
'xfdkwusj': 12,
'gyklresa': 24,
'nsince': 117,
'image': 3542,
'message': 4037,
'cid': 3326,
'jpg': 1582,
'ff': 175,
'dd': 240,
'nbad': 1,
'thing': 148,
'suck': 1,
'bandwidth': 1,
'slow': 1299,
'connection': 2770,
'go': 1622,
'pasue': 1,
'lunch': 63,
'gb': 740,
'leave': 270,
'return': 872,
'nxfdkwusj': 12,
'nsales': 272,
'earthworks': 72,
'european': 224,
'serve': 84,
'area': 490,
'south': 527,
'note': 1459,
'new': 4517,
'telephone': 301,
'address': 1645,
'effective': 466,
'ncompany': 552,
'infrastructure': 163,
'gmbh': 158,
'ngesch': 161,
'ftsf': 147,
'hrer': 147,
'phvkowml': 178,
'azbtkqwx': 178,
'und': 368,
'naruedlk': 200,
'mpvhakdq': 200,
'nuksytoh': 6,
'whovmtez': 6,
'sbcheyu': 6,
'group': 3241,
'eagcutview': 26,
'peojqgvm': 8,
'qayeptuo': 8,
'need': 6874,
'instal': 594,
'lauacyltoe': 65,
'hxgaycze': 1586,
'install': 1700,
'people': 132,
'hi': 799,
'team': 3746,
'ccfterguss': 1,
'acces': 81,
'site': 4130,
'hour': 560,
'appear': 826,
'attachment': 1129,
'instruction': 55,
'driver': 664,
'au': 101,
'receivable': 2,
'do': 442,
'public': 135,
'folder': 1262,
'germany': 2368,
'available': 2109,
'hostname': 10862,
'nis': 171,
'network': 4363,
'nwhen': 347,
'could': 934,
'solve': 351,
'qlhmawgi': 82,
'sgwipoxn': 73,
'roaghyunokepc': 12,
'general': 118,
'query': 308,
'rdp': 4,
'want': 537,
'without': 562,
'awareness': 46,
'nchecked': 65,
'happen': 612,
'default': 711,
'setting': 498,
'programdntys': 63,
'feature': 71,
'patch': 172,
'installed': 87,
'nappreciated': 2,
'ask': 277,
'inform': 180,
'activity': 670,
'java': 658,
'warehouse': 610,
'toolmail': 66,
'seem': 569,
'pick': 379,
'incoming': 261,
'attempt': 405,
'uacyltoe': 1888,
'cell': 100,
'phone': 3048,
'ring': 92,
'still': 1457,
'let': 934,
'expire': 286,
'connect': 2444,
'regular': 35,
'printer': 5739,
't': 110,
'wseacnvi': 6,
'azvixyqg': 6,
'qa': 427,
'jdhdw': 6,
'recognize': 60,
'email': 3490,
'upgrade': 506,
'try': 2489,
'enter': 966,
'windows': 603,
'version': 837,
'read': 492,
'nconnection': 447,
'interface': 1526,
'app': 1057,
'destination': 287,
'nproduction': 50,
'vendor': 3381,
'connc': 98,
'okay': 164,
'nerror': 449,
'rfc': 32,
'cpic': 26,
'supply': 546,
'chain': 368,
'software': 2006,
'nfrom': 498,
'ftnijxup': 15,
'sbltduco': 15,
'nsent': 1297,
'saturday': 115,
'october': 786,
'pm': 2927,
'nto': 1330,
'thsyrley': 6,
'shi': 33,
'nwfodmhc': 728,
'exurcwkm': 728,
'nsubject': 1263,
'xad': 285,
'finish': 112,
'op': 16,
'process': 2443,
'nhelp': 63,
'sign': 394,
'ess': 180,
'portal': 1377,
'es': 197,
'tablet': 157,
'boot': 587,
'booting': 70,
'ie': 570,
'browser': 112,
'hand': 249,
'ynsqjehx': 4,
'kqgrsawl': 4,
'like': 1200,
'samsung': 9,
'npls': 144,
'offer': 2,
'nbest': 1614,
'hotspot': 2,
'lockout': 72,
'qwvpgayb': 2,
'amniujsh': 2,
'click': 298,
'session': 234,
'nothing': 234,
'pls': 832,
'luagmhds': 2,
'iymwcelx': 3,
'helpdesk': 16,
'ppm': 1,
'review': 623,
'outside': 2306,
'office': 1180,
'today': 1055,
'data': 3584,
'nthx': 115,
'png': 1821,
'c': 9065,
'e': 12290,
'nmicheyi': 1,
'gyhus': 1,
'nvice': 19,
'president': 21,
'nluagmhds': 1,
'inc': 1488,
'www': 331,
'm': 621,
'crmdynamics': 2,
'deployment': 83,
'dynamic': 92,
'zebra': 91,
'label': 860,
'make': 1336,
'nprinter': 155,
'prtoplant': 2,
'nmultiple': 4,
'location': 1096,
'affect': 287,
'also': 2020,
'nxlzpgfr': 14,
'rlqowmyt': 14,
'mail': 1314,
'shot': 698,
'johghajknn': 4,
'information': 2907,
'ygkzwsud': 13,
'cvjgkxws': 13,
'even': 418,
'correct': 962,
'accept': 371,
'nneed': 280,
'resolution': 176,
'urgent': 1234,
'basis': 352,
'require': 1455,
'android': 267,
'luxdnsvk': 3,
'qmnyzcfs': 3,
'would': 1215,
'nkindly': 445,
'know': 1470,
'procedure': 587,
'nwith': 519,
'best': 774,
'locked': 198,
'laptop': 3039,
'mii': 629,
'wauhocsk': 6,
'vxuikqaf': 6,
'nhallo': 144,
'friend': 6,
'problem': 3772,
'come': 946,
'everything': 95,
'gtehdnyu': 69,
'say': 613,
'ok': 324,
'successfully': 68,
'nnow': 48,
'reason': 369,
'personal': 293,
'certificate': 817,
'tejahd': 1,
'easdwmdwrappa': 1,
'nerp': 214,
'disconnection': 34,
'early': 387,
'bex': 430,
'convert': 114,
'excel': 654,
'sheet': 49,
'monday': 459,
'xtsuifdz': 6,
'wktgzcyl': 6,
'load': 993,
'loading': 38,
'fgljepar': 4,
'xpsarwiz': 6,
'yhrdw': 2,
'hdld': 2,
'geman': 2,
'nquestions': 5,
'nthanx': 2,
'njwbsdd': 2,
'ddmefoche': 2,
'nsr': 84,
'application': 2388,
'engineer': 212,
'nfgljepar': 2,
'wu': 381,
'kick': 18,
'reconnected': 2,
'stop': 656,
'minute': 406,
'activation': 88,
'hpqc': 56,
'nuhfwplj': 112,
'ojcwxser': 112,
'receive': 13992,
'fe': 218,
'disable': 210,
'vvfrtgarnb': 12,
'assign': 1323,
'plm': 1245,
'plsseald': 3,
'miss': 970,
'munxvfhw': 4,
'texsbopi': 6,
'ndear': 1252,
'sir': 537,
'npkj': 2,
'u': 2406,
'guest': 72,
'nprakaythsh': 2,
'kujigalore': 2,
'nsenior': 57,
'nmunxvfhw': 2,
'registration': 77,
'infopath': 58,
'kuznvase': 4,
'jrxtbuqz': 4,
'urgently': 241,
'proceed': 99,
'price': 1062,
'discount': 271,
'obviously': 1,
'manage': 73,
'ms': 225,
'feed': 30,
'faf': 1,
'zhrgtangs': 4,
'kbclinop': 45,
'vsczklfp': 45,
'screensaver': 45,
'marhty': 3,
'financial': 506,
'admin': 458,
'right': 413,
'f': 11804,
'ac': 425,
'probleme': 290,
'bei': 24,
'der': 145,
'projekt': 2,
'eingabe': 2,
'im': 221,
'dear': 250,
'nstill': 69,
'tracker': 63,
'etc': 671,
'njertyur': 2,
'difozlav': 56,
'dgbfptos': 56,
'person': 716,
'nthey': 32,
'project': 572,
'nmit': 440,
'freundlichen': 418,
'gr': 737,
'en': 703,
'freeze': 155,
'bcd': 89,
'travel': 311,
'side': 238,
'creation': 74,
'fail': 17278,
'ngru': 42,
'simfghon': 4,
'owa': 124,
'configure': 328,
'nexplained': 1,
'already': 610,
'yimwfntl': 2,
'rkdwohas': 2,
'part': 313,
'fro': 2,
'first': 372,
'kxmidsga': 26,
'zokivdfa': 26,
'provide': 1384,
'ap': 943,
'dargthy': 2,
'sohfyuimaiah': 2,
'nap': 69,
'ivohcdpw': 23,
'ixcanwbm': 23,
'dwivethn': 2,
'fjtrnslb': 4,
'ejzkrchq': 4,
'export': 172,
'shipment': 503,
'grind': 123,
'mahcine': 6,
'khdgd': 9,
'apac': 774,
'advance': 186,
'rdfjsawg': 74,
'zpmxgdcw': 74,
'hdytrkfiu': 3,
'thursday': 242,
'rfqhoaky': 3,
'rncspbot': 3,
'ikxjfnml': 3,
'kaocbpfr': 3,
'ncc': 735,
'tshljagr': 3,
'mhyiopxr': 3,
'zgmdster': 3,
'bdvcealj': 3,
'xjzcbgnp': 6,
'vfkwscao': 6,
'avurmegj': 16,
'pxgmjynu': 16,
'nwe': 2389,
'pushixepyfbga': 3,
'wtqdyoin': 31,
'dispatch': 34,
'place': 269,
'seek': 30,
'support': 1260,
'clear': 253,
'consignment': 129,
'chkmejsn': 4,
'lvidgknc': 4,
'inplant': 273,
'page': 577,
'iqmhjlwr': 5,
'jqmxaybi': 5,
'guy': 24,
'since': 3121,
'delay': 106,
'per': 1159,
'screenshot': 413,
'cff': 5,
'rgtarthi': 41,
'erjgypa': 41,
'possible': 1371,
'anything': 107,
'ebkfwhgt': 2,
'flapokym': 2,
'venkbghksh': 1,
'sv': 103,
'pl': 301,
'arrange': 188,
'immediately': 535,
'npghjkanijkraj': 1,
'security': 1548,
'notification': 1229,
'vsiemxgh': 4,
'lgeciroy': 4,
'mam': 42,
'prompt': 107,
'organization': 153,
'perform': 238,
'apokrfjv': 1,
'mdiepcul': 1,
'donnathyr': 1,
'well': 394,
'nconsider': 14,
'validate': 20,
'authentic': 14,
'continue': 140,
'determine': 151,
'valid': 151,
'intune': 14,
'azure': 266,
'nwant': 14,
'nsincerely': 315,
'send': 1380,
'unmonitored': 13,
'reply': 138,
'lkrfndev': 1,
'kztlojin': 1,
'territory': 27,
'sale': 2956,
'director': 179,
'resign': 6,
'response': 662,
'forecast': 58,
'future': 564,
'approval': 424,
'authorization': 364,
'full': 1181,
'audio': 313,
'video': 241,
'respond': 643,
'must': 332,
'close': 703,
'join': 195,
'experience': 334,
'take': 952,
'really': 127,
'long': 598,
'zdsxmcwu': 6,
'thdjzolwronization': 4,
'grey': 7,
'font': 56,
'size': 346,
'nusername': 50,
'owenssdcl': 2,
'nsupervisor': 8,
'phjencfg': 3,
'kwtcyazx': 3,
'vnhaycfo': 54,
'smkpfjzv': 54,
'nwsjkbw': 2,
'owwddwens': 2,
'naccounting': 56,
'specialist': 4,
'web': 847,
'iygsxftl': 6,
'hysrbgad': 6,
'unblock': 12,
'technical': 129,
'research': 23,
'ne': 129,
'g': 1774,
'fec': 3,
'ceec': 3,
'synching': 11,
'yuxloigj': 13,
'tzfwjxhe': 13,
'tuesday': 316,
'september': 737,
'action': 873,
'url': 126,
'january': 3,
'kindly': 487,
'vvdortddp': 2,
'aksthyuhath': 7,
'shettythruy': 7,
'pradtheyp': 40,
'navbrtheen': 1,
'gogtr': 2,
'utoegyqx': 19,
'lhosidqg': 19,
'nemp': 8,
'nname': 118,
'nuseid': 6,
'nyaxmwdth': 2,
'xsfgitmq': 2,
'nvvdortddp': 2,
'nnavbrtheen': 1,
'sfb': 18,
'shop': 314,
'floor': 409,
'authentication': 118,
'vpv': 2,
'lately': 3,
'else': 81,
'bwfhtumx': 70,
'japznrvb': 134,
'b': 6319,
'nbwfhtumx': 64,
'regional': 112,
'controller': 188,
'shivakuhdty': 2,
'tiyhum': 71,
'kuyiomar': 71,
'nyour': 69,
'exist': 815,
'replay': 1,
'npossible': 1,
'ncompanysecure': 1,
'setup': 1534,
'store': 70,
'nwindows': 172,
'match': 264,
'logon': 566,
'balance': 390,
'suggest': 134,
'submit': 593,
'nvawmlch': 4,
'ubyjolnc': 4,
'find': 2167,
'period': 192,
'pas': 5,
'word': 407,
'wi': 91,
'fi': 130,
'noutlook': 9,
'rrc': 43,
'box': 218,
'iauqlrjk': 26,
'nijdaukz': 26,
'added': 70,
'estfhycoastrrc': 8,
'muywpnof': 2,
'prtikusy': 2,
'owner': 76,
'efbwiadp': 4,
'dicafxhv': 6,
'shesyhur': 9,
'posrt': 43,
'previously': 67,
'nafter': 293,
'hxgayczeing': 257,
'aware': 102,
'safely': 1,
'delete': 1099,
'wireless': 277,
'zxobmreq': 7,
'udikorhv': 7,
'nattachment': 3,
'figure': 177,
'nwas': 42,
'hotel': 4,
'vithrkas': 2,
'xiwegtas': 2,
'ygrfbzon': 2,
'pop': 110,
'displayed': 39,
'virus': 106,
'restart': 672,
'pc': 4698,
'anyother': 2,
'prt': 115,
'prtqv': 22,
'prtqz': 6,
'impact': 671,
'award': 58,
'gqwdslpc': 3,
'clhgpqnb': 3,
'remember': 7,
'question': 795,
'music': 12,
'run': 3044,
'etime': 146,
'skmdgnuh': 105,
'utgclesd': 105,
'nrequire': 2,
'anantadth': 2,
'warm': 118,
'lgeuniqf': 3,
'ijsnyxgf': 3,
'colleagues': 32,
'why': 7,
'my': 247,
'colleague': 135,
'diehm': 3,
'is': 1751,
'asked': 3,
'this': 579,
'no': 1485,
'then': 178,
"can't": 105,
'in': 839,
'nthanks': 30,
'gru': 25,
'hardcopy': 95,
'xvgftyr': 2,
'tryfuh': 2,
'cnkoflhi': 2,
'abeoucfj': 2,
'savin': 12,
'azxhejvq': 14,
'fyemlavd': 14,
'nuserid': 11,
'waldjrrm': 1,
'expense': 987,
'inxsupmy': 35,
'zhwmifvx': 97,
'material': 1452,
'ninxsupmy': 62,
'nteam': 80,
'lead': 243,
'ssl': 128,
'source': 941,
'logistics': 57,
'global': 1142,
'vip': 444,
'janhduh': 8,
'keehad': 8,
'fvkgaalen': 8,
'msd': 189,
'turn': 605,
'opportunity': 46,
'world': 12,
'cyndy': 3,
'jose': 3,
'nemailed': 3,
'obanjrhg': 16,
'rnafleys': 16,
'participant': 44,
'house': 52,
'attached': 648,
'spread': 51,
'ylqvitsk': 1,
'bfnackrw': 1,
'wednesday': 219,
'rakthyesh': 67,
'wg': 84,
'cost': 750,
'center': 1130,
'nimportance': 64,
'high': 1193,
'sept': 197,
'ntime': 11,
'nlocation': 460,
'rth': 281,
'room': 616,
'nevent': 3061,
'training': 159,
'metal': 20,
'cut': 38,
'different': 397,
'list': 1483,
'attendance': 151,
'reference': 531,
'aryndruh': 1,
'wjbtlxry': 1,
'gdbqzjyw': 1,
'comment': 129,
'franhtyu': 2,
'liu': 7,
'nfranhtyu': 24,
'xa': 564,
'nselect': 195,
'link': 761,
'disclaimer': 195,
'alternate': 230,
'language': 729,
'intermittent': 94,
'gui': 117,
'screenshots': 231,
'nrequest': 417,
'earliest': 2,
'something': 313,
'related': 278,
'ktghvuwr': 6,
'uwtakcmj': 12,
'ef': 277,
'nktghvuwr': 6,
'noptimization': 3,
'occur': 337,
'town': 5,
'week': 570,
'never': 51,
'schedule': 1834,
'yxliakph': 20,
'soucfnqe': 20,
'none': 310,
'apprentice': 8,
'suniythulkuujmar': 6,
'nk': 12,
'salary': 42,
'statement': 50,
'npl': 175,
'recover': 64,
'nold': 23,
'vvkhyhums': 6,
'dhad': 6,
'wuz': 6,
'detect': 617,
'dell': 690,
'usb': 380,
'adapter': 99,
'insurance': 8,
'license': 476,
'activate': 108,
'connectivity': 236,
'eu': 3612,
...}
print("Total number of words in the dictionary : ", len(word_counts))
Total number of words in the dictionary : 13412
dataset_resampled["text_tokens"] = [ word_tokenize(txt) for txt in dataset_resampled["Text"] ]
dataset_resampled.head()
| Assignment group | Text | text_tokens | |
|---|---|---|---|
| 3131 | GRP_0 | erp login trouble x n x nreceived xosycftu olh... | [erp, login, trouble, x, n, x, nreceived, xosy... |
| 2584 | GRP_0 | frequent account lock frequent account lock x ... | [frequent, account, lock, frequent, account, l... |
| 4247 | GRP_0 | login issue x nlogin issue x n verify user det... | [login, issue, x, nlogin, issue, x, n, verify,... |
| 7560 | GRP_0 | engineering tool work engineering tool work | [engineering, tool, work, engineering, tool, w... |
| 5250 | GRP_0 | able access sid n nreceived miecoszw mhvbnodw ... | [able, access, sid, n, nreceived, miecoszw, mh... |
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import ensemble
import numpy, textblob, string
top_k = 10000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
oov_token="<unk>",
filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(dataset_resampled['Text'])
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
train_seqs = tokenizer.texts_to_sequences(dataset_resampled['Text'])
Finding length distribution of the sequences
fig = px.box( x=[ len(y) for y in train_seqs])
fig.show()
max_features = 10000
maxlen = 300
embedding_size = 200
X = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post',maxlen=maxlen)
Y = dataset_resampled["Assignment group"].astype("category")
Y.dtype
CategoricalDtype(categories=['GRP_0', 'GRP_10', 'GRP_12', 'GRP_13', 'GRP_14', 'GRP_19',
'GRP_2', 'GRP_24', 'GRP_25', 'GRP_3', 'GRP_33', 'GRP_5',
'GRP_6', 'GRP_8', 'GRP_9', 'GRP_X'],
, ordered=False)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
(43891, 300) (43891,) (10973, 300) (10973,)
encoder = LabelEncoder()
train_y = encoder.fit_transform(Y_train)
test_y = encoder.fit_transform(Y_test)
def evaluate_model(model, x_train, y_train, x_test, y_test):
model.fit(x_train,y_train)
train_acc = metrics.accuracy_score( y_train, model.predict(x_train) )
test_acc = metrics.accuracy_score( y_test , model.predict(x_test ) )
con_mat = metrics.confusion_matrix(y_test , model.predict(x_test ))
print("Train Accuracy -> ",round(train_acc *100,2),"%")
print("Test Accuracy -> ",round(test_acc *100,2),"%")
print("F1 score -> ",metrics.f1_score( y_test , model.predict(x_test ), average='micro'))
print("\n")
fig = px.imshow(con_mat,labels=dict(x="Prediction", y="Real", color="Count"),
x=encoder.classes_,
y=encoder.classes_)
fig.show()
Naive Bayes
evaluate_model(naive_bayes.MultinomialNB() ,X_train, train_y, X_test, test_y )
Train Accuracy -> 15.33 % Test Accuracy -> 14.78 % F1 score -> 0.14781736990795588
Logistic Regression
evaluate_model(linear_model.LogisticRegression() ,X_train, train_y, X_test, test_y )
Train Accuracy -> 27.14 % Test Accuracy -> 26.42 % F1 score -> 0.2641939305568213
SVM
evaluate_model(svm.SVC() ,X_train, train_y, X_test, test_y )
Train Accuracy -> 63.29 % Test Accuracy -> 61.28 % F1 score -> 0.6127768158206507
Random Forest
evaluate_model(ensemble.RandomForestClassifier() ,X_train, train_y, X_test, test_y )
Train Accuracy -> 92.19 % Test Accuracy -> 90.96 % F1 score -> 0.9095962817825571
KNN Classifier
evaluate_model(KNeighborsClassifier(n_neighbors= 5 , weights = 'distance') ,X_train, train_y, X_test, test_y )
Train Accuracy -> 89.73 % Test Accuracy -> 86.69 % F1 score -> 0.8669461405267475
DecisionTree
model = DecisionTreeClassifier(criterion = "entropy")
evaluate_model(model ,X_train, train_y, X_test, test_y )
Train Accuracy -> 92.19 % Test Accuracy -> 89.92 % F1 score -> 0.8992071448099882
ADABoosting
evaluate_model(AdaBoostClassifier(n_estimators=50,random_state=1) ,X_train, train_y, X_test, test_y )
Train Accuracy -> 21.43 % Test Accuracy -> 21.47 % F1 score -> 0.2147088307664267
Gradient Boosting
evaluate_model(GradientBoostingClassifier(n_estimators = 50,random_state=1) ,X_train, train_y, X_test, test_y )
Train Accuracy -> 73.28 % Test Accuracy -> 71.05 % F1 score -> 0.710471156474984
Bagging
dTreeR = DecisionTreeClassifier()
evaluate_model(BaggingClassifier(base_estimator=dTreeR, n_estimators=50,random_state=1) ,X_train, train_y, X_test, test_y )
Train Accuracy -> 92.19 % Test Accuracy -> 90.5 % F1 score -> 0.9050396427595006
SGD Classifier
evaluate_model(SGDClassifier(loss="hinge", penalty="l2") ,X_train, train_y, X_test, test_y )
Train Accuracy -> 20.07 % Test Accuracy -> 19.99 % F1 score -> 0.19994532033172333
Comparing models using LazyText
!pip install lazytext
from lazytext.supervised import LazyTextPredict
lazy_text = LazyTextPredict(
classification_type="multiclass",
)
print(lazy_text.get_all_classifiers)
models = lazy_text.fit(X_train, X_test, train_y, test_y)
Collecting lazytext
Downloading lazytext-0.0.2-py2.py3-none-any.whl (9.8 kB)
Requirement already satisfied: pandas==1.3.5 in /usr/local/lib/python3.7/dist-packages (from lazytext) (1.3.5)
Collecting scikit-learn==1.0.1
Downloading scikit_learn-1.0.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.2 MB)
|████████████████████████████████| 23.2 MB 1.7 MB/s
Collecting rich==11.2.0
Downloading rich-11.2.0-py3-none-any.whl (217 kB)
|████████████████████████████████| 217 kB 50.6 MB/s
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas==1.3.5->lazytext) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas==1.3.5->lazytext) (2.8.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.7/dist-packages (from pandas==1.3.5->lazytext) (1.21.5)
Requirement already satisfied: typing-extensions<5.0,>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from rich==11.2.0->lazytext) (3.10.0.2)
Collecting colorama<0.5.0,>=0.4.0
Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich==11.2.0->lazytext) (2.6.1)
Collecting commonmark<0.10.0,>=0.9.0
Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
|████████████████████████████████| 51 kB 7.0 MB/s
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==1.0.1->lazytext) (3.1.0)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==1.0.1->lazytext) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn==1.0.1->lazytext) (1.1.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas==1.3.5->lazytext) (1.15.0)
Installing collected packages: commonmark, colorama, scikit-learn, rich, lazytext
Attempting uninstall: scikit-learn
Found existing installation: scikit-learn 1.0.2
Uninstalling scikit-learn-1.0.2:
Successfully uninstalled scikit-learn-1.0.2
Successfully installed colorama-0.4.4 commonmark-0.9.1 lazytext-0.0.2 rich-11.2.0 scikit-learn-1.0.1
Training AdaBoostClassifier estimator
{'AdaBoostClassifier': <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>, 'BaggingClassifier': <class 'sklearn.ensemble._bagging.BaggingClassifier'>, 'BernoulliNB': <class 'sklearn.naive_bayes.BernoulliNB'>, 'CalibratedClassifierCV': <class 'sklearn.calibration.CalibratedClassifierCV'>, 'ComplementNB': <class 'sklearn.naive_bayes.ComplementNB'>, 'DecisionTreeClassifier': <class 'sklearn.tree._classes.DecisionTreeClassifier'>, 'DummyClassifier': <class 'sklearn.dummy.DummyClassifier'>, 'ExtraTreeClassifier': <class 'sklearn.tree._classes.ExtraTreeClassifier'>, 'ExtraTreesClassifier': <class 'sklearn.ensemble._forest.ExtraTreesClassifier'>, 'GradientBoostingClassifier': <class 'sklearn.ensemble._gb.GradientBoostingClassifier'>, 'KNeighborsClassifier': <class 'sklearn.neighbors._classification.KNeighborsClassifier'>, 'LinearSVC': <class 'sklearn.svm._classes.LinearSVC'>, 'LogisticRegression': <class 'sklearn.linear_model._logistic.LogisticRegression'>, 'LogisticRegressionCV': <class 'sklearn.linear_model._logistic.LogisticRegressionCV'>, 'MLPClassifier': <class 'sklearn.neural_network._multilayer_perceptron.MLPClassifier'>, 'MultinomialNB': <class 'sklearn.naive_bayes.MultinomialNB'>, 'NearestCentroid': <class 'sklearn.neighbors._nearest_centroid.NearestCentroid'>, 'NuSVC': <class 'sklearn.svm._classes.NuSVC'>, 'PassiveAggressiveClassifier': <class 'sklearn.linear_model._passive_aggressive.PassiveAggressiveClassifier'>, 'Perceptron': <class 'sklearn.linear_model._perceptron.Perceptron'>, 'RandomForestClassifier': <class 'sklearn.ensemble._forest.RandomForestClassifier'>, 'RidgeClassifier': <class 'sklearn.linear_model._ridge.RidgeClassifier'>, 'SGDClassifier': <class 'sklearn.linear_model._stochastic_gradient.SGDClassifier'>, 'SVC': <class 'sklearn.svm._classes.SVC'>}
Training BaggingClassifier estimator
Training BernoulliNB estimator
Training CalibratedClassifierCV estimator
Training ComplementNB estimator
Training DecisionTreeClassifier estimator
Training DummyClassifier estimator
Training ExtraTreeClassifier estimator
Training ExtraTreesClassifier estimator
Training GradientBoostingClassifier estimator
Training KNeighborsClassifier estimator
Training LinearSVC estimator
Training LogisticRegression estimator
Training LogisticRegressionCV estimator
Training MLPClassifier estimator
Training MultinomialNB estimator
Training NearestCentroid estimator
Training NuSVC estimator
Training PassiveAggressiveClassifier estimator
Training Perceptron estimator
Training RandomForestClassifier estimator
Training RidgeClassifier estimator
Training SGDClassifier estimator
Training SVC estimator
Result Analysis ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳┳┓ ┃ Model ┃ Accuracy ┃ Balanced Accuracy ┃ F1 Score ┃┃┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇╇┩ │ AdaBoostClassifier │ 0.21470883076642… │ 0.21574487438851… │ 0.1989704548323668 │││ │ BaggingClassifier │ 0.90294358880889… │ 0.90230347364342… │ 0.9078010401975063 │││ │ BernoulliNB │ 0.17488380570491… │ 0.17542266443754… │ 0.11146512334545577 │││ │ CalibratedClassifierCV │ 0.23293538685865… │ 0.23412049838176… │ 0.19363065316714978 │││ │ ComplementNB │ 0.12913514991342… │ 0.13012322236087… │ 0.10548330616814479 │││ │ DecisionTreeClassifier │ 0.89784015310307… │ 0.89727184372477… │ 0.9023135753058475 │││ │ DummyClassifier │ 0.05987423676296… │ 0.0625 │ 0.0070614789337919… │││ │ ExtraTreeClassifier │ 0.89866034812722… │ 0.89809149253015… │ 0.9032641728144752 │││ │ ExtraTreesClassifier │ 0.90977854734347… │ 0.90899424596747… │ 0.9149950807407792 │││ │ GradientBoostingClassifi… │ 0.80588717761778… │ 0.806337022169324 │ 0.80603973958625 │││ │ KNeighborsClassifier │ 0.823749202588171 │ 0.82550204926477… │ 0.8176615458943616 │││ │ LinearSVC │ 0.10079285519001… │ 0.10179830191988… │ 0.090160011124007 │││ │ LogisticRegression │ 0.26419393055682… │ 0.26494003077250… │ 0.2593797306251261 │││ │ LogisticRegressionCV │ 0.27112002187186… │ 0.27173874908486… │ 0.26661835950072477 │││ │ MLPClassifier │ 0.74619520641574… │ 0.74717998972350… │ 0.7509030212329035 │││ │ MultinomialNB │ 0.14781736990795… │ 0.14896905187543… │ 0.1267277805899355 │││ │ NearestCentroid │ 0.17980497584981… │ 0.18123025805248… │ 0.155788160915052 │││ │ NuSVC │ 0.54880160393693… │ 0.54849568335512… │ 0.5492471563849667 │││ │ PassiveAggressiveClassif… │ 0.21407090130319… │ 0.21406388978498… │ 0.19889440949761428 │││ │ Perceptron │ 0.17962271028889… │ 0.17959800973142… │ 0.1909172492037351 │││ │ RandomForestClassifier │ 0.908776086758407 │ 0.90798843170886… │ 0.9140067883343734 │││ │ RidgeClassifier │ 0.24879249065889 │ 0.25066819981446… │ 0.22833549926773059 │││ │ SGDClassifier │ 0.189373917798232 │ 0.18977828607895… │ 0.18661854583824933 │││ │ SVC │ 0.61277681582065… │ 0.61192608695007… │ 0.6223427738297788 │││ └───────────────────────────┴───────────────────┴───────────────────┴─────────────────────┴┴┘
We did grid search with varying parameters to optimise the model. But, we are unable to get beyond 91.07% test accuracy and 92% train accuracy.
NOTE: The following code will take a very long time to execute. so after trial we have commented it out to save time
# from sklearn.model_selection import GridSearchCV
# param_grid = {
# 'max_depth': [80, 90, 100, 110],
# 'max_features': [2, 3],
# 'n_estimators': [100, 200, 300, 1000]
# }# Create a based model
# rf = ensemble.RandomForestClassifier()# Instantiate the grid search model
# grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
# cv = 3, n_jobs = -1, verbose = 2)
# param_grid = {
# 'max_depth': [50, 60, 70],
# 'max_features': [5,10 ,20],
# 'n_estimators': [800, 1000, 1200]
# }# Create a based model
# rf = ensemble.RandomForestClassifier()# Instantiate the grid search model
# grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
# cv = 3, n_jobs = -1, verbose = 2)
# grid_search.fit(X_train, train_y)
# grid_search.best_params_
# best_random = grid_search.best_estimator_
# y_pred = best_random.predict(X_test )
# train_acc = metrics.accuracy_score( train_y, best_random.predict(X_train) )
# test_acc = metrics.accuracy_score( test_y , y_pred )
# con_mat = metrics.confusion_matrix(test_y ,y_pred)
# print("Train Accuracy -> ",round(train_acc *100,2),"%")
# print("Test Accuracy -> ",round(test_acc *100,2),"%")
# print("F1 score -> ",metrics.f1_score( test_y , y_pred, average='micro'))
# print("\n")
# fig = px.imshow(con_mat)
# fig.show()
Comparing models using LazyText
From the above comparison RNN shows the best results so we decided to use RNN as our final model for automatic ticket prediction.
Importing Packages
!pip install langdetect
!pip install deep-translator
import numpy as np
import pandas as pd
import tensorflow as tf
from langdetect import detect
from deep_translator import GoogleTranslator
from sklearn.utils import resample
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, GRU, Conv1D, MaxPooling1D, Bidirectional
Requirement already satisfied: langdetect in /usr/local/lib/python3.7/dist-packages (1.0.9) Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from langdetect) (1.15.0) Requirement already satisfied: deep-translator in /usr/local/lib/python3.7/dist-packages (1.8.3) Requirement already satisfied: beautifulsoup4<5.0.0,>=4.9.1 in /usr/local/lib/python3.7/dist-packages (from deep-translator) (4.10.0) Requirement already satisfied: requests<3.0.0,>=2.23.0 in /usr/local/lib/python3.7/dist-packages (from deep-translator) (2.23.0) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.7/dist-packages (from beautifulsoup4<5.0.0,>=4.9.1->deep-translator) (2.3.1) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (1.24.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (2021.10.8) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.23.0->deep-translator) (3.0.4)
dataset = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/capstone project/input_data.xlsx')
dataset.head()
| Short description | Description | Caller | Assignment group | |
|---|---|---|---|---|
| 0 | login issue | -verified user details.(employee# & manager na... | spxjnwir pjlcoqds | GRP_0 |
| 1 | outlook | _x000D_\n_x000D_\nreceived from: hmjdrvpb.komu... | hmjdrvpb komuaywn | GRP_0 |
| 2 | cant log in to vpn | _x000D_\n_x000D_\nreceived from: eylqgodm.ybqk... | eylqgodm ybqkwiam | GRP_0 |
| 3 | unable to access hr_tool page | unable to access hr_tool page | xbkucsvz gcpydteq | GRP_0 |
| 4 | skype error | skype error | owlgqjme qhcozdfx | GRP_0 |
dataset["Text"] = dataset["Short description"] + " " + dataset["Description"]
dataset.drop(["Short description","Description","Caller"],axis=1,inplace=True)
dataset
| Assignment group | Text | |
|---|---|---|
| 0 | GRP_0 | login issue -verified user details.(employee# ... |
| 1 | GRP_0 | outlook _x000D_\n_x000D_\nreceived from: hmjdr... |
| 2 | GRP_0 | cant log in to vpn _x000D_\n_x000D_\nreceived ... |
| 3 | GRP_0 | unable to access hr_tool page unable to access... |
| 4 | GRP_0 | skype error skype error |
| ... | ... | ... |
| 8495 | GRP_29 | emails not coming in from zz mail _x000D_\n_x0... |
| 8496 | GRP_0 | telephony_software issue telephony_software issue |
| 8497 | GRP_0 | vip2: windows password reset for tifpdchb pedx... |
| 8498 | GRP_62 | machine não está funcionando i am unable to ... |
| 8499 | GRP_49 | an mehreren pc`s lassen sich verschiedene prgr... |
8500 rows × 2 columns
we tried different methods to remove the unwanted characters but the below method resulted in higher accuracy of our final model.
def removeString(data, regex):
return data.str.lower().str.replace(regex.lower(), ' ')
def cleanDataset(dataset, columnsToClean, regexList):
for column in columnsToClean:
for regex in regexList:
dataset[column] = removeString(dataset[column], regex)
return dataset
def getRegexList():
'''
Adding regex list as per the given data set to flush off the unnecessary text
'''
regexList = []
regexList += ['From:(.*)\r\n'] # from line
regexList += ['Sent:(.*)\r\n'] # sent to line
regexList += ['received from:(.*)\r\n'] # received data line
regexList += ['received'] # received data line
regexList += ['To:(.*)\r\n'] # to line
regexList += ['CC:(.*)\r\n'] # cc line
regexList += ['(.*)infection'] # footer
regexList += ['\[cid:(.*)]'] # images cid
regexList += ['https?:[^\]\n\r]+'] # https & http
regexList += ['Subject:']
regexList += ['[\w\d\-\_\.]+@[\w\d\-\_\.]+'] # emails are not required
regexList += ['[0-9][\-0–90-9 ]+'] # phones are not required
regexList += ['[0-9]'] # numbers not needed
regexList += ['[^a-zA-z 0-9]+'] # anything that is not a letter
regexList += ['[\r\n]'] # \r\n
regexList += ['^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$']
regexList += ['[\w\d\-\_\.]+ @ [\w\d\-\_\.]+']
regexList += ['Subject:']
regexList += ['[^a-zA-Z]']
regexList += [' [a-zA-Z] '] # single letters makes no sense
regexList += [' [a-zA-Z][a-zA-Z] '] # two-letter words makes no sense
regexList += [" "] # double spaces
return regexList
columnsToClean = ['Text']
clean_dataset = cleanDataset(dataset, columnsToClean, getRegexList())
clean_dataset
| Assignment group | Text | |
|---|---|---|
| 0 | GRP_0 | login issue verified user details employee ma... |
| 1 | GRP_0 | outlook d d from d hello team d d meetings s... |
| 2 | GRP_0 | cant log to vpn d d from d d d cannot log to... |
| 3 | GRP_0 | unable access tool page unable access tool page |
| 4 | GRP_0 | skype error skype error |
| ... | ... | ... |
| 8495 | GRP_29 | emails not coming from mail d d from d good ... |
| 8496 | GRP_0 | telephony software issue telephony software issue |
| 8497 | GRP_0 | vip windows password reset for tifpdchb pedxr... |
| 8498 | GRP_62 | machine o est funcionando unable access the ma... |
| 8499 | GRP_49 | an mehreren lassen sich verschiedene prgramdnt... |
8500 rows × 2 columns
We are using langdetect to identify the language and GoogleTranslate to translate non-English rows to English
def fn_lan_detect(df):
try:
return detect(df)
except:
return 'no'
clean_dataset['language'] =clean_dataset['Text'].apply(fn_lan_detect)
clean_dataset["language"].value_counts()
en 6169 sl 444 de 393 fr 318 af 305 da 162 no 155 sv 128 ca 104 it 72 nl 64 es 49 pl 26 tl 15 hr 13 sq 13 ro 13 cy 12 et 11 pt 9 id 7 sk 4 lt 3 cs 3 so 3 fi 3 vi 1 lv 1 Name: language, dtype: int64
total = len(clean_dataset)
for i, row in clean_dataset.iterrows():
if (row["language"] != "en"):
try :
clean_dataset["Text"][i] = GoogleTranslator(source='auto', target='en').translate(row['Text'])
except :
clean_dataset["Text"][i] = row["Text"]
print(f"entrys completed : {i}/{total}",end="\r")
clean_dataset.drop("language",axis=1,inplace=True)
clean_dataset
| Assignment group | Text | |
|---|---|---|
| 0 | GRP_0 | login issue verified user details employee ma... |
| 1 | GRP_0 | outlook d d from d hello team d d meetings s... |
| 2 | GRP_0 | cant log to vpn d d from d d d cannot log to... |
| 3 | GRP_0 | unable access tool page unable access tool page |
| 4 | GRP_0 | skype error skype error |
| ... | ... | ... |
| 8495 | GRP_29 | emails not coming from mail d d from d good ... |
| 8496 | GRP_0 | telephony software issue telephony software issue |
| 8497 | GRP_0 | vip windows password reset for tifpdchb pedxr... |
| 8498 | GRP_62 | machine o est funcionando unable access the ma... |
| 8499 | GRP_49 | different programs cannot be opened on several... |
8500 rows × 2 columns
clean_dataset.drop_duplicates(inplace = True)
clean_dataset.fillna('', inplace=True)
counts = clean_dataset["Assignment group"].value_counts()
clean_dataset["Assignment group"] = np.where(counts[clean_dataset["Assignment group"]] < 100 , "GRP_X", clean_dataset["Assignment group"])
clean_dataset["Assignment group"].value_counts()
GRP_0 3198 GRP_X 1688 GRP_8 329 GRP_24 277 GRP_12 245 GRP_2 241 GRP_19 214 GRP_3 200 GRP_13 142 GRP_14 118 GRP_25 116 GRP_33 107 Name: Assignment group, dtype: int64
using resampling we upsampled the data to match with the count of most frequent group.
maxOthers =dataset["Assignment group"].value_counts().max()
dataset_resampled = dataset[0:0]
for grp in dataset['Assignment group'].unique():
ticket_dataset_group = dataset[dataset['Assignment group'] == grp]
resampled = resample(ticket_dataset_group, replace=True, n_samples=int(maxOthers), random_state=123)
dataset_resampled = dataset_resampled.append(resampled)
dataset_resampled["Assignment group"].value_counts()
GRP_0 3198 GRP_X 3198 GRP_3 3198 GRP_8 3198 GRP_12 3198 GRP_13 3198 GRP_14 3198 GRP_19 3198 GRP_2 3198 GRP_24 3198 GRP_25 3198 GRP_33 3198 Name: Assignment group, dtype: int64
we are using keras Tokenizer to tokenize the rows and Label Encoder to encode the groups.
maxlen = 300
numWords=9000
epochs = 10
batch_size = 100
tokenizer = Tokenizer(num_words=numWords, oov_token="<unk>", filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(dataset_resampled['Text'])
train_seqs = tokenizer.texts_to_sequences(dataset_resampled['Text'])
We are padding the sequences to a length of 300
encoder = LabelEncoder()
X = pad_sequences(train_seqs, padding='post',maxlen=maxlen)
Y = dataset_resampled["Assignment group"].astype("category")
Y = encoder.fit_transform(Y)
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = 0.20, random_state = 42)
print(x_train.shape, y_train.shape )
print(x_test.shape, y_test.shape )
(30700, 300) (30700,) (7676, 300) (7676,)
glove_file = '/content/drive/MyDrive/Colab Notebooks/capstone project/glove.6B.50d.txt'
embeddings_glove = {}
for o in open(glove_file):
word = o.split(" ")[0]
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
embeddings_glove[word] = embd
embedding_matrix = np.zeros((numWords+1, 50))
for i,word in tokenizer.index_word.items():
if i<numWords+1:
embedding_vector = embeddings_glove.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
embed = Embedding(numWords+1,output_dim=50,input_length=maxlen,weights=[embedding_matrix], trainable=True)
model=Sequential()
model.add(Input(shape=(maxlen,),dtype=tf.int64))
model.add(embed)
model.add(Conv1D(100,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.3))
model.add(Conv1D(100,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.3))
model.add(Dense(100,activation='relu'))
model.add(Dense(len((pd.Series(y_train)).unique()),activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
model.summary()
checkpoint = ModelCheckpoint('model-{epoch:03d}-{val_accuracy:03f}.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
reduceLoss = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_history = model.fit(x_train,y_train,batch_size=batch_size, epochs=epochs, callbacks=[checkpoint,reduceLoss], validation_data=(x_test, y_test))
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 300, 50) 450050
conv1d (Conv1D) (None, 291, 100) 50100
max_pooling1d (MaxPooling1D (None, 145, 100) 0
)
dropout (Dropout) (None, 145, 100) 0
conv1d_1 (Conv1D) (None, 136, 100) 100100
max_pooling1d_1 (MaxPooling (None, 68, 100) 0
1D)
bidirectional (Bidirectiona (None, 256) 234496
l)
dropout_1 (Dropout) (None, 256) 0
dense (Dense) (None, 100) 25700
dense_1 (Dense) (None, 12) 1212
=================================================================
Total params: 861,658
Trainable params: 861,658
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
307/307 [==============================] - ETA: 0s - loss: 1.3357 - accuracy: 0.5484
Epoch 1: val_accuracy improved from -inf to 0.83872, saving model to model-001-0.838718.h5
307/307 [==============================] - 34s 69ms/step - loss: 1.3357 - accuracy: 0.5484 - val_loss: 0.4949 - val_accuracy: 0.8387 - lr: 0.0010
Epoch 2/10
307/307 [==============================] - ETA: 0s - loss: 0.3712 - accuracy: 0.8747
Epoch 2: val_accuracy improved from 0.83872 to 0.92092, saving model to model-002-0.920922.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.3712 - accuracy: 0.8747 - val_loss: 0.2309 - val_accuracy: 0.9209 - lr: 0.0010
Epoch 3/10
307/307 [==============================] - ETA: 0s - loss: 0.2039 - accuracy: 0.9301
Epoch 3: val_accuracy improved from 0.92092 to 0.94203, saving model to model-003-0.942027.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.2039 - accuracy: 0.9301 - val_loss: 0.1781 - val_accuracy: 0.9420 - lr: 0.0010
Epoch 4/10
307/307 [==============================] - ETA: 0s - loss: 0.1409 - accuracy: 0.9508
Epoch 4: val_accuracy improved from 0.94203 to 0.95922, saving model to model-004-0.959224.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.1409 - accuracy: 0.9508 - val_loss: 0.1144 - val_accuracy: 0.9592 - lr: 0.0010
Epoch 5/10
307/307 [==============================] - ETA: 0s - loss: 0.1063 - accuracy: 0.9633
Epoch 5: val_accuracy improved from 0.95922 to 0.95935, saving model to model-005-0.959354.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.1063 - accuracy: 0.9633 - val_loss: 0.1243 - val_accuracy: 0.9594 - lr: 0.0010
Epoch 6/10
307/307 [==============================] - ETA: 0s - loss: 0.0855 - accuracy: 0.9706
Epoch 6: val_accuracy improved from 0.95935 to 0.96704, saving model to model-006-0.967040.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.0855 - accuracy: 0.9706 - val_loss: 0.1045 - val_accuracy: 0.9670 - lr: 0.0010
Epoch 7/10
307/307 [==============================] - ETA: 0s - loss: 0.0673 - accuracy: 0.9778
Epoch 7: val_accuracy improved from 0.96704 to 0.96756, saving model to model-007-0.967561.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.0673 - accuracy: 0.9778 - val_loss: 0.1066 - val_accuracy: 0.9676 - lr: 0.0010
Epoch 8/10
307/307 [==============================] - ETA: 0s - loss: 0.0655 - accuracy: 0.9777
Epoch 8: val_accuracy improved from 0.96756 to 0.97199, saving model to model-008-0.971991.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.0655 - accuracy: 0.9777 - val_loss: 0.0882 - val_accuracy: 0.9720 - lr: 0.0010
Epoch 9/10
307/307 [==============================] - ETA: 0s - loss: 0.0530 - accuracy: 0.9821
Epoch 9: val_accuracy improved from 0.97199 to 0.97551, saving model to model-009-0.975508.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.0530 - accuracy: 0.9821 - val_loss: 0.0856 - val_accuracy: 0.9755 - lr: 0.0010
Epoch 10/10
307/307 [==============================] - ETA: 0s - loss: 0.0528 - accuracy: 0.9819
Epoch 10: val_accuracy improved from 0.97551 to 0.97681, saving model to model-010-0.976811.h5
307/307 [==============================] - 20s 65ms/step - loss: 0.0528 - accuracy: 0.9819 - val_loss: 0.0818 - val_accuracy: 0.9768 - lr: 0.0010
y_pred = model.predict(x_test )
pred =[]
for val in y_pred:
pred.append(np.argmax(val))
test_acc = metrics.accuracy_score( y_test , pred )
con_mat = metrics.confusion_matrix(y_test , pred)
print("Test Accuracy -> ",round(test_acc *100,2),"%")
print("F1 score -> ",metrics.f1_score( y_test , pred, average='micro'))
print("\n")
fig = px.imshow(con_mat,labels=dict(x="Prediction", y="Real", color="Count"),
x=encoder.classes_,
y=encoder.classes_)
fig.show()
Test Accuracy -> 97.68 % F1 score -> 0.9768108389786347
plt.plot(model_history.history['accuracy'])
plt.plot(model_history.history['val_accuracy'])
plt.title('RNN model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('RNN model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
train accuracy : 98.97%\ test accuracy : 97.64%\ f1 score : 0.98
import joblib
model.save("/content/drive/MyDrive/Colab Notebooks/capstone project/final/rnn_final.h5")
joblib.dump(tokenizer, '/content/drive/MyDrive/Colab Notebooks/capstone project/final/tokenizer.pkl')
joblib.dump(encoder, '/content/drive/MyDrive/Colab Notebooks/capstone project/final/label_encoder.pkl')
['/content/drive/MyDrive/Colab Notebooks/capstone project/final/label_encoder.pkl']
Business impact or benefits in implementing this in an organisation This AI based algorithm streamlines the process of ticketing to help large organisations to save time and human effort. The key benefits in implementing this system in an organisation gives
For easy entry of problem and Description of an IT ticket by a user we developed a simple User Interface (UI) which will use our developed RNN model to assign the Tickets to the respective groups. For this we use the saved model and encoder files.
UI Development:\ We used Figma to create the layout of the UI as shown below\ https://www.figma.com
We used Tkinter-Designer package to convert the Figma file to a Tkinter based python code.\ https://github.com/ParthJadhav/Tkinter-Designer
File 1 : gui.py – this file contains the generated UI code ( we run this file to start the app )
File 2: scripts.py – contains the predict function
File 3: preprocessing.py – contains the preprocessing functions